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[57] ABSTRACT 

Multiple nodes can concurrently gain membership in a 
cluster of nodes of a distributed computer system by broad- 
casting reconfiguration messages to all nodes of the distrib- 
uted computer system. In response to a reconfiguration 
request resulting from a node petitioning to join a cluster or 
a node leaving the cluster, each node determines to which 
nodes of the distributed computer system the node is 
connected, i.e., which are sending reconfiguration messages 
which the node receives. In addition, if multiple nodes fail 
substantially simultaneously, each node which continues to 
operate does not receive a reconfiguration message from 
each of the failed nodes and the failed nodes are omitted 
from the proposed new cluster. Thus, multiple simultaneous 
failures are processed in a single reconfiguration. Each of the 
member nodes of the proposed cluster determine the mem- 
bership of the proposed cluster and broadcast a reconfigu- 
ration message to all proposed member nodes and collects 
similar messages. If all reconfiguration messages agree, the 
proposed cluster is accepted. In the case in which one or 
more nodes leave the cluster, quorum is established in the 
new cluster relative to the old cluster. 

45 Claims, 10 Drawing Sheets 
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SYSTEM AND METHOD FOR MODIFYING 

MEMBERSHIP IN A CLUSTERED 
DISTRIBUTED COMPUTER SYSTEM AND 
UPDATING SYSTEM CONFIGURATION 

FIELD OF THE INVENTION 5 

The present invention relates to distributed computer 
systems and, in particular, to a particularly efficient mecha- 
nism by which membership in the distributed computer 
system can be determined in the presence of computer 10 
system failures. 

BACKGROUND OF THE INVENTION 

Distributed computers systems rival and even surpass 
processing capabilities of supercomputers which repre- 15 
sented the state of the art even just a few years ago. 
Distributed computer systems achieve such processing 
capacity by dividing tasks into smaller components and 
distributing those components to member computers of the 
distributed computer system, each of which processes a 
respective component of the task while other member com- 
puters simultaneously process other components of the task. 
Larger distributed computer systems promise ever increas- 
ing processing capacity at ever decreasing cost. 

While distributed computer systems provide excellent 25 
processing capacity, such systems are particularly suscep- 
tible to computer hardware and software failures. Distrib- 
uted computer systems have multiple computers with 
multiple, redundant components such as processors, 
memory and storage devices, and system software and 30 
further include communications media connecting the mul- 
tiple member computers of the distributed computer system. 
Failure of any of the many constituent components of the 
distributed computer system can result in unavailability of 
the distributed computer system. Accordingly, a very impor- 35 
tant component of any distributed computer system is the 
ability of the system to tolerate individual or multiple, 
simultaneous faults. Such fault tolerance of a distributed 
computer system makes such a system more reliable than 
most single computers. Specifically, failure of a substantial 40 
portion of the distributed computer system is tolerated and 
processing by the distributed computer system, while dimin- 
ished in capacity, continues. 

In general, distributed computer systems must meet a 
number of criteria to properly tolerate faults and to func- 45 
tional adequately. First, all constituent computers of the 
distributed computer systems, which are sometimes referred 
to as "nodes," must agree regarding which of the nodes are 
members of a cluster. A cluster is generally a number of 
nodes of a distributed computer system which collective 50 
cooperate to perform distributed processing. If nodes of a 
distributed computer system disagree as to the membership 
of the cluster, nodes can also disagree as to which nodes 
have a quorum and therefore have access to shared resources 
and data. The likelihood for simultaneous, inconsistent 55 
access of the shared resources and data; and therefore 
corruption of the data, is great. Second, no single-point 
failure within a cluster can result in complete unavailability 
of the cluster. Such susceptibility to failure is generally 
unacceptable. Third, nodes of a cluster which has a quorum 60 
are never in disagreement regarding the state of the cluster. 
A cluster which has a quorum has exclusive access to 
resources which the nodes of the cluster would otherwise 
share with other nodes of the distributed computer system. 
And fourth, isolated or faulty nodes of a cluster must be 65 
removed from the cluster in a finite period of time, e.g., one 
minute. 
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Some currently available distributed computer systems 
can tolerate at most one failure of any node or communi- 
cations link of the system at one lime and can tolerate 
consecutive failure of every node but one. The ability to 
tolerate multiple, simultaneous failures in a distributed com- 
puter system greatly improves the reliability of such a 
distributed computer system. 

SUMMARY OF THE INVENTION 

In accordance with the present invention, multiple nodes 
can join a cluster simultaneously. Specifically, one or more 
nodes petitioning to join the cluster each determine to which 
nodes of the distributed computer system the nodes are 
connected, i.e., which are sending messages which the 
petitioning nodes receive, regardless of membership of each 
such node in the current cluster. The petitioning nodes send 
a reconfigure message proposing a new cluster which 
includes as members all nodes to which the petitioning node 
is connected. 

The proposed cluster can include as members nodes 
which are connected to the petitioning node and which are 
not members of the current cluster. Accordingly, more than 
one node can join the cluster in a single reconfiguration, 
thereby reducing the number of times a cluster must be 
reconfigured when multiple nodes are ready to join the 
cluster substantially simultaneously. Such is possible if 
multiple nodes are unavailable due to failure of a single 
communications link which is subsequently revived. Each 
node receiving the reconfigure message, referred to as a 
petitioned node, similarly determines all other nodes to 
which the node is connected and responds with reconfigure 
message which proposes a respective new cluster including 
all such nodes. The petitioning and petitioned nodes collect 
all reconfiguration messages and, if all the reconfiguration 
messages unanimously propose the same proposed cluster, 
the proposed cluster is accepted as new. Thus, unanimous 
agreement as to the membership of the cluster is assured. 

Further in accordance with the present invention, multiple 
nodes can leave a cluster simultaneously. Failure to receive 
messages from a particular node in a predetermined period 
is detected as a failure of the node. In response to the 
detected failure, the node detecting the failure sends a 
reconfigure message. Each node receiving the reconfigure 
message broadcasts in response thereto a reconfigure mes- 
sage to all nodes and determines from which nodes a 
reconfigure message is received. Thus, each node deter- 
mines to which other nodes the node is operatively con- 
nected and configures a proposed new cluster which 
includes as members the connected nodes. If multiple nodes 
fail substantially simultaneously, each node which continues 
to operate does not receive any messages from each of the 
failed nodes and the failed nodes are omitted from the 
proposed new cluster. Thus, multiple simultaneous failures 
are processed in a single reconfiguration. 

Since the failure of a node can be either a failure of the 
nodes itself or the communications link connecting the node 
to the remainder of the distributed computer system, the 
proposed new cluster is not accepted as the new cluster 
unless the proposed new cluster can establish a quorum 
relative to the previous member of the cluster. If the previous 
cluster had only two member nodes, quorum is established 
by a race mechanism. If the two member nodes of the 
previous cluster do not share a quorum device, an alternative 
mechanism is used to establish quorum. If the previous 
cluster had more than two member nodes, quorum is estab- 
lished by a vote mechanism in which one of the member 
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nodes of the previous cluster is designated the crown prince 
to resolve quorum votes which result in a tie. 

Accordingly, a distributed computer system in accordance 
with the present invention can tolerate simultaneous failure 
of up to one -half of the member nodes of a cluster. Failure s 
of more than one -half of the member nodes of the cluster 
prevent the cluster from achieving a quorum. However, 
since quorum is established relative to the previous mem- 
bership of the cluster and not relative to all nodes of the 
distributed computer system, the distributed computer sys- 30 
tern can tolerate a series of multiple -node failures as long as 
each multiple-node failure includes failure of no more than 
one-half of the nodes surviving the previous multiple-node 
failure until only one node remains operative. The distrib- 
uted computer system according to the present invention is 15 
therefore particularly robust and improves significantly the 
likelihood that the functionality provided by the distributed 
computer system will continue to be provided despite mul- 
tiple simultaneous, or a series of multiple simultaneous, 
failures. 20 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a distributed computer 
system in accordance with the present invention. 25 

FIG. 2 is a block diagram of two nodes of the distributed 
computer system of FIG. 1 which share a number of devices 
and each of which includes a cluster membership monitor in 
accordance with the present invention. 

FIG. 3 is a block diagram of a cluster membership 30 
monitor of FIG. 2. 

FIG. 4 is a logic flow diagram illustrating the petitioning 
of a node to join a cluster in the distributed computer system 
of FIG. 1 in accordance with the present invention. 

FIG. 5 is a logic flow diagram illustrating the processing 
of nodes in response to the petitioning shown in FIG. 4 to 
determine membership in a new cluster in accordance with 
the present invention. 

FIG. 6 is a logic flow diagram illustrating the leaving of 40 
a node from a cluster in accordance with the present inven- 
tion. 

FIG. 7 is a logic flow diagram illustrating negotiation for 
quorum based on the previously current cluster membership 
in accordance with the present invention, 45 

FIG. 8 is a logic flow diagram illustrating a race for 
quorum in response to one node leaving a cluster having two 
member nodes. 

FIG. 9 is a logic flow diagram illustrating a vote for 
quorum in response to one or more nodes leaving a cluster 50 
having more than two member nodes. 

FIG. 10 is a block diagram showing individual threads of 
the cluster membership monitor of FIG. 3 according to one 
embodiment. 55 

DETAILED DESCRIPTION 

In accordance with the present invention, membership in 
a cluster of nodes in the distributed computer system is 
determined in a way which permits multiple nodes to go 
simultaneously join or leave the cluster. As a result, the 
distributed computer system continues to provide service in 
spite of multiple simultaneous node failures. 

FIG. 1 shows an illustrative example of a distributed 
computer system 100 which includes nodes 0-5. Nodes 0-5 65 
are fully interconnected, i.e., distributed computer system 
100 includes a direct communications link between each of 
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nodes 0-5 and each other of nodes 0-5. Distributed com- 
puter system 100 also includes a number of storage devices 
102A-F, each of which serves as a quorum device in one 
embodiment Storage device 102A is connected between and 
shared by nodes 0 and 3. Storage device 102B is connected 
between and shared by nodes 3 and 5. Storage device 102C 
is connected between and shared by nodes 5 and 1. Storage 
device 102 D is connected between and shared by nodes 1 
and 4. Storage device 102 E is connected between and shared 
by nodes 4 and 2. Storage device 102F is connected between 
and shared by nodes 2 and 0. Nodes 0-5 are described in 
greater detail below. 

Cluster membership is determined by each of nodes 0-5 
individually in such a manner that each node arrives at the 
same result and multiple, simultaneous failures are detected 
and properly handled. Each of nodes 0-5 includes a cluster 
membership monitor (CMM) which is a computer process 
executing within each of nodes 0-5. To facilitate apprecia- 
tion of the present invention, a number of hardware com- 
ponents of nodes 0-5, and therefore the operating environ- 
ment for each of the CMMs, are described. 

FIG. 2 shows nodes 0 and 3. Each of nodes 1-5, including* 
node 3, are directly analogous to node 0 and the following 
description of node 0 is equally applicable to each of nodes 
1-5. Node 0 includes one or more processors 202 A, each of 
which retrieves computer instructions from memory 204A 
through an interconnect 206A and executes the retrieved 
computer instructions. In executing retrieved computer 
instructions, each of processors 202 A can retrieve data from 
and write data to memory 204A and any and all of shared 
storage devices 102A and 212A-C through interconnect 
206A. Interconnect 206 can be generally any interconnect 
mechanism for computer system components and can be, 
e.g., a bus, a crossbar, a mesh, a torus, or a hypercube. 
Memory 204A can include any type of computer memory 
including, without limitation, randomly accessible memory 
(RAM), read-only memory (ROM), and storage devices 
which use magnetic and/or optical storage media such as 
magnetic and/or optical disks. Shared storage devices 102 A 
and 212A-C are each a storage device or an array of storage 
devices which can be simultaneously coupled to two or more 
computers. As shown in FIG. 2, each of shared storage 
devices 102A and 212A-C is coupled both to interconnect 
206Aof node 0 and to interconnect 206B of node 3. Each of 
shared storage devices 102A and 212A-C is accessed by 
each of nodes 0 and 3 as a single device although each of 
shared storage devices 102Aand 212A-C can be an array of 
storage devices. For example, any of shared storage devices 
102Aand 212A-C can be a SPARC Storage Array available 
from Sun Microsystems, Inc. of Mountain View, Calif. 

Sun, Sun Microsystems, and the Sun Logo are trademarks 
or registered trademarks of Sun Microsystems, Inc. in the 
United States and other countries. All SPARC trademarks 
are used under license and are trademarks of SPARC 
International, Inc. in the United States and other countries. 
Products bearing SPARC trademarks are based upon an 
architecture developed by Sun Microsystems, Inc. 

Each of shared storage devices 102A and 212A-C can be 
reserved by either node 0 or node 3. For example, any of 
processors 202Acan issue control signals through intercon- 
nect 206A to shared storage device 212C which cause 
reservation of storage device 212C. In response to the 
control signals, shared storage device 212C determines 
whether shared storage device 212C is already reserved as 
represented in the physical state of shared storage device 
212C, e.g., in the slate of a flag or an identification of the 
holder of the current reservation as represented in a register 
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of shared storage device 212C. If shared storage device period of time to receive reconfiguration messages from 

212C is not currently reserved, shared storage device 212C nodes 1-5. In one embodiment, the predetermined period of 

changes its physical state to indicate that shared storage time is thirty seconds. As described in more detail below 

device 2 12 C is now reserved by node 0. Conversely, if with respect to step 504 (FIG. 5), each member node of a 

shared storage device 212 C is currently reserved, shared 5 cluster responds to a reconfiguration message received from 

storage device 212C sends through interconnect 206 A to a non-member node by broadcasting a responding recon- 

processors 202A signals which indicate that the attempted figuration message. By waiting to receive reconfiguration 

reservation is refused. messages from all nodes, CMM 220A (FIG. 3) determines 

In addition, each of processors 202A can issue control which, if any, of nodes 1-5 are operative and in communi- 

signals to a network access device 208A which cause 10 cation with node 0. When CMM 220A has received recon- 

network access device 208A to transfer data through net- figuration messages from all of nodes 1-5 or when the 

work 210 between network access device 208A of node 0 predetermined period of time has expired, whichever occurs 

and network access device 208B of node 3 in a conventional first, processing transfers to step 408 (FIG. 4). In step 408, 

manner. Network 210 includes all of the communications CMM 220A(FIG. 3) updates next cluster size field 308 and 

links between nodes 0-5 shown in FIGS. 1 and 2. In one next duster vector 310 to represent a cluster which includes 

embodiment, network 210 (FIG. 2) is the well-known Eth- node 0 and all nodes from which CMM 220A receives a 

emet network and network access devices 208A and 208B reconfiguration message in step 406 (FIG. 4). Thus, in steps 

are conventional Ethernet controller circuitry. m and m , CMM 220A (FIG. 3) builds a prospective 

Node 0 includes a cluster membership monitor (CMM) chlster which mchl d cs all nodes which appear to be opera- 

220A which is a computer process executing in processors tive and { connected to node 0. 

202A from memory 204A. CMM 220A implements a state 20 Jt , ... t , tiL . . t . . lt . , . 

automaton which includes representation of tie state of node 1 s * ould be note f at tms P omt ? * ^ UiplC D ° d f * C ™H 

0 with respect to distributed computer system 100 (FIG. 1) * ^ s * r m a sin S le reconfiguration. For example, node 2 

and of a current cluster of distributed computer system 100. FIG - *) can perform the steps of logic flow diagram 400 

CMM 220Ais shown in greater detail in FIG. 3 and includes ( FIG - 4 ) wmlc oodc 0 Performs the steps of logic flow 

a number of fields which collectively represent the state of 25 400 concurrently and independently. Accordingly, 

a cluster of nodes 0-5 (FIG. 1). A field is data which reconfiguration messages broadcast by nodes 0 and 2 in 

collectively represent a component of information. independent, analogous performances of step 404 (FIG. 4) 

Specifically, CMM 220A (FIG. 3) includes an identification are received by nodes 0 (FIG. 1) and 2 in independent, 

field 302, a cluster size field 304, a cluster vector field 306, analogous performances of step 406 (FIG. 4). Accordingly, 

a next cluster size field 308, and a next cluster vector field 30 nodes 0 and 2 include each other in a prospective new cluster 

310. in independent, analogous performances of step 408 (FIG. 

Identification field 302 includes data which uniquely 4). 

identifies node 0 and distinguishes node 0 from nodes 1-5 In test steps 410 (FIG. 4) and 412, CMM 220A (FIG. 3) 

(FIG. 1). The data stored in identification field 302 (FIG. 3) determines whether the prospective cluster is proper, 

are sometimes collectively referred to herein as the identifier 35 Specifically, in test step 410 (FIG. 4), CMM 220A (FIG. 3) 

of node 0. Cluster size field 304 (FIG. 3) includes data which compares the cluster size represented in cluster size field 304 

specify a number of nodes included in the cluster to which to a value of one to determine whether any node other than 

node 0 is a member. Cluster vector field 306 includes data node 0 is a member of the prospective cluster. If the cluster 

which identify each member node of the cluster to which size is greater than one, processing transfers to step 414 

node 0 is a member. Accordingly, cluster vector field 306 d0 (FIG. 4) which is described below. Conversely, if the cluster 

includes the identifier of node 0 and can include the iden- size is not greater than one, processing transfers to test step 

tifiers of each of nodes 1-4 (FIG. 1). Next cluster size field 412. 

308 and next cluster vector field 310 collectively represent in test step 412, CMM 220A (FIG. 3) determines whether 

a state of a prospective cluster during reconfiguration as node 0 is isolated, i.e., whether all communications links 

described below and are analogous to cluster size field 304 45 between node 0 and other nodes of distributed computer 

and cluster vector field 306, respectively. system 100 (FIG. 1) have failed. If node 0 is not isolated but 

When CMM 220A(FIG. 2) of node 0 is initialized, CMM is instead the sole member of a cluster, node 0 can safely 

220A attempts to join a cluster which includes any of nodes participate in competitions for quorum, which are described 

1-5 (FIG. 1) according to the steps of logic flow diagram more completely below, and other nodes can subsequently 

400 (FIG. 4). Processing according to logic flow diagram 50 join the cluster of which node 0 is the sole member. It is 

400 begins in step 402 in which CMM 220A (FIG. 3) generally preferred to prevent isolated nodes from operating 

initializes cluster size field 304 to zero and cluster vector on shared data since such presents a substantial risk that such 

field 306 to represent an empty set to indicate that no nodes data will become corrupted by the isolated node or other 

are currently a member of the current cluster. Processing nodes which are not in communication with the isolated 

transfers to step 404 (FIG. 4) in which CMM 220A (FIG. 3) 55 node. However, a node which is the sole member of a cluster 

broadcasts a reconfiguration message to nodes 1-5. Arecon- is permitted to continue processing, 

figuration signal generally includes a message type field From the perspective of CMM 220A(FIG. 2) of node 0, 

which indicates that the message is a reconfiguration mes- isolation of node 0 and exclusive membership in a single - 

sage and includes the identifier of the node sending the node cluster are indistinguishable. In one embodiment, the 

reconfiguration message and the cluster size and vector 60 determination regarding whether node 0 is isolated requires 

fields of the node sending the reconfiguration message. human intervention. A human operator generally provides 

CMM 220A broadcasts the reconfiguration message to all data, through physical manipulation of user input devices 

nodes which are potentially members of a new cluster, i.e., (not shown) of node 0 using conventional techniques, which 

to nodes 1-5 (FIG. 1), regardless of each node's member- indicates whether node 0 is isolated. The data can be 

ship in any current clusters. 65 provided before hand and stored in a node configuration 

In step 406 (FIG. 4), to which processing transfers from field (not shown) from which CMM 220A(FIG. 3) retrieves 

step 404, CMM 22 OA (FIG. 3) waits for a predetermined the data. Alternatively, the operator can be prompted to 
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provide the data by CMM 22 OA using conventional user- burden on distributed computer systems with a large number 

interface techniques. If node 0 is isolated, processing trans- of nodes. However, reconfiguration of a cluster of the nodes 

fers from test step 412 (FIG. 4) to step 420 in which node should be a relatively infrequent occurrence since each node 

0 fails to join a cluster and CMM 220A (FIG. 3) aborts and each communications link is preferably relatively stable 

processing in the manner described more completely below. s ^ reliable 

Conversely if node 0 is noi I isolated node 0 proceeds to Logic flow diagram 500 (FIG. S) illustrates the processing 

form a cluster to which node 0 is the sole member and r & , ' • . c & 

processing transfers to step 414 (FIG. 4). of ,. a P eUUoncd n ° dc m res P 0D f t0 ""f of a reconfi f' 

In step 414, CMM 220A (FIG. 3) requests a reconfigu- rah ° n meSSa 8 e ^ en 00 rec ° nfi S uiatl0n 15 " P r °f 

; . . » j . . j . r nodes receiving the reconfiguration message perform the 

ration of the current cluster of distributed computer system 1fl . ri - <f t nn n J! „ „ *i a 

+ m>i^ ^\ . . . ii J 10 steps of logic now diagram 500 generally concurrently and 

100 (FIG. 1) by broadcasting a reconfiguration message . / , ° tl » j .i * * „ c , n e 

. . i . , . . • i . ■ * _ independently. As described above, all of nodes 0-5 are 

which includes the prospective cluster size and vector rep- r „ , , t . ™ f f „ 

, . / r . „ . , - x . r 4 generally analogous to one another. Therefore, the following 

resented in next cluster size field 308 (FIG. 3) and next 5 * *• n ■ a j- mn- *u ' . . r ^ 

i * * c ,j nn ruxi -»^aa u j * ( . desenpuon of logic flow diagram 500 in the context of node 

cluster vector field 310. CMM 220A broadcasts the recon- A • n ? -i * r » n 

£ . 4 , , , _ , . , 0 is equally applicable to performance or the steps of logic 

figuration message to each of nodes 1-5 which is identified 1( r « 7. ' *nnu *u j j « * 

. & . & , .i . r i • « 15 flow diagram 500 by any other petitioned ones of nodes 0-5 

in next cluster vector 310. In the context of logic now ,„ T ~ T 4 , * / ri a j- cft A j ft 

» n a y n , x . . r (FI G. 1 ) . In the context of logic flow diagram 500, node 0 

diagram 400 (FIG. 4), each such node ts referred to as a V. ^ node ^ , he * ne of nodes 1-5 which sends 

petitioned node. Processing transfers to step 416 (FIG. 4) in ^ <• j 1 • r 

u- un^K^i^ftA/cir if f j # • j • j the reconfiguration message, e.g., node 1, is referred to as 
which CMM 220A(FIG. 3) waits for a predetermmed period ^ ^ node p^ssiZg according to logic flow 
of time to rece.ve reconfiguration messages from all peti- M ^ ^ ^ ^ ^ ste ^ 5) 
tioned nodes. In one embodiment, the predetermined period & ° v 
of time is thirty seconds. The manner by which a petitioned In ste P 502 > 220A ( FIG 3 ) of te Phoned node, 
node receives a reconfiguration message from CMM 220A e S > node °> receives the reconfiguration message from the 
of node 0 and replies with another reconfiguration message petitioning node. CMM 220A ascertains that the reconfigu- 
is described more completely below with respect to logic „ ratl0n message is a petition to join the current cluster by 
flow diagram 500 (FIG. 5). When CMM 220A (FIG. 3) determining that the petitioning node, i.e., the source of the 
receives a reconfiguration message from each petitioned reconfiguration message, is not a member of the current 
node or when the predetermined period of time expires, cluster - As described above, multiple nodes can petition for 
which ever occurs first, processing of CMM 220A transfers membership in the cluster m a single reconfiguration, 
to test step 418 (FIG 4) Accordingly, the petitioned node can receive more than one 
In test step 418, CMM 220A(FIG. 3) determines whether 3 ° reconfiguration message in step 502. For simphfication of 
reconfiguration messages have been received from all peti- ^ following description, it is assumed that only a single 
tioned nodes. If CMM 220A fails in step 416 to receive a node 1S ^uoning for membership in the cluster, 
reconfiguration message from any of the petitioned nodes, Processing transfers to step 504 (FIG. 5) in which CMM 
processing transfers from test step 418 to step 420 and the 35 220A ( FIG - 3 ) broadcasts a reconfiguration message to all 
reconfiguration fails. In step 420, CMM 220A (FIG. 3) prospective members of a prospective cluster, which 
aborts processing and does not update cluster size field 304 includes all members of the current cluster and the petition- 
and cluster vector field 306 to represent the prospective in S nodc - B y broadcasting the reconfiguration message, 
cluster. After step 420, processing according to logic flow CMM 220A notifies all prospective members of the pro- 
diagram 400 (FIG. 4) terminates. 40 spective cluster that node 0 is operational and connected. 

Conversely, if CMM 220A(FIG. 3) determines in test step In step 506 (FIG. 5), to which processing transfers from 
418 (FIG. 4) that reconfiguration messages from all peti- step 504, CMM 220A (FIG, 3) waits for a predetermined 
tioned nodes are received in step 416, processing transfers to period of time to receive reconfiguration messages from all 
test step 422. In test step 422, CMM 220A (FIG. 3) compares prospective members of the prospective cluster excluding 
the received reconfiguration messages to determine whether 45 m « petitioning node since a reconfiguration message was 
all the received reconfiguration messages represent exactly previously received by the petitioned node, e.g., node 0, in 
the same cluster, i.e., whether all received reconfiguration step 502 (FIG. 5). Specifically, reconfiguration messages 
messages agree as to cluster membership in the prospective received in step 506 include reconfiguration messages 
cluster. If any of the received reconfiguration messages do broadcast by other petitioned nodes in analogous, indepen- 
not agree as to cluster membership, processing transfers 50 dent performances of step 504. In one embodiment, the 
from test step 422 (FIG. 4) to step 420 in which the predetermined period of time is thirty seconds, 
reconfiguration of the cluster fails in the manner described When reconfiguration messages have been received from 
above. Conversely, if all received reconfiguration messages all prospective members of the prospective cluster have been 
agree as to membership in the prospective cluster, process- received by CMM 220A(FIG. 3) or when the predetermined 
ing transfers from test step 422 to step 424. In step 424, the 55 time period expires, whichever occurs first, processing trans- 
prospective cluster is accepted and node 0 saves the pro- fers to step 508 (FIG. 5). In step 508, CMM 220A (FIG. 3) 
spective cluster as the current cluster by copying data from stores in next cluster size field 308 and next cluster vector 
next cluster size field 308 (FIG. 3) and next cluster vector field 310 data which represents a cluster whose membership 
field 310 to cluster size field 304 and cluster vector field 306, includes all nodes from which reconfiguration messages are 
respectively. After step 424 (FIG. 4), processing according 60 received in step 506 (FIG. 5), including the petitioned node, 
to logic flow diagram 400 terminates. e.g., node 0. Accordingly, next cluster size field 308 (FIG. 3) 
Thus, a new cluster configuration is negotiated by broad- a nd next cluster vector field 310 store data representing a 
casting a reconfiguration message to all available node over prospective cluster which includes as members all nodes 
all available communications links and receiving confirma- which are operational and which are in communication with 
tion from each petitioned node. It is noted that broadcasting 65 toe petitioned node. 

reconfiguration messages to all available nodes over all It is important to note that, since CMM 220A determines 

available communications links imposes a relatively heavy which of the nodes of the cluster are connected and func- 
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lioning in steps 502 (FIG. 5) and 506 and forms the 
prospective new cluster from these nodes in step 508, 
multiple nodes can be added to the cluster simultaneously. 
Thus, in steps 404-408 (FIG. 4) and 502-508 (FIG. 5), all 
member nodes of a prospective new cluster determine 5 
independently which other nodes are operative and in com- 
munication with the member nodes to thereby ascertain 
membership of the new, prospective cluster. Accordingly, 
multiple nodes can join the cluster simultaneously. It should 
also be noted that one or more nodes which fail to respond 3Q 
with reconfiguration messages which are therefore not 
received in independent performances of step 406 (FIG. 4) 
or step 506 (FIG. 5) by each member of the prospective 
cluster are excluded from membership in the prospective 
cluster. Accordingly, a node can join the cluster while 
another node leaves the cluster in a single reconfiguration of 15 
the cluster. 

Steps 510-520 are generally analogous to steps 414-424 
(FIG. 4) in that the petitioned nodes each determine whether 
all other members of the prospective cluster are in unani- 
mous agreement with respect to the membership of the 20 
prospective cluster. Specifically, processing transfers from 
step 508 (FIG. 5) to step 510 in which CMM 220A(FIG. 3) 
broadcasts to all members of the prospective cluster a 
reconfiguration message which includes data specifying the 
prospective cluster, i.e., specifying the number and identity 25 
of the members of the prospective cluster. In step 512 (FIG. 
5), CMM 220A (FIG. 3) waits for a predetermined period of 
time to receive reconfiguration messages from all members 
of the prospective cluster. In one embodiment, the prede- 
termined period of time is thirty seconds. When reconfigu- 30 
ration messages are received from all members of the 
prospective cluster or when the predetermined period of 
time expires, whichever occurs first, processing transfers to 
test step 514 (FIG. 5) in which CMM 220A (FIG. 3) begins 
to determine whether the members of the prospective cluster 3S 
unanimously agree to the prospective cluster's membership. 

In test step 514, CMM 220A(FIG. 3) determines whether 
a reconfiguration message is received from every member of 
the prospective cluster in step 512 (FIG. 5). If CMM 220A 
(FIG. 3) fails to receive a reconfiguration message from any 40 
of the members of the prospective cluster during the prede- 
termined time period in step 512 (FIG. 5), processing 
transfers from test step 514 to step 516. In step 516, the 
petitioning node is refused membership in the cluster and the 
cluster remains unchanged, i.e., data stored in next cluster 45 
size field 308 (FIG. 3) and next cluster vector field 310 are 
not moved into cluster size field 304 and cluster vector field 
306. After step 516 (FIG. 5), processing according to logic 
flow diagram 500 terminates. 

Conversely, if CMM 220A(FIG. 3) determines in test step 50 
514 (FIG. 5) that reconfiguration messages are received 
from all members of the prospective cluster, processing 
transfers from test step 514 to test step 518. In test step 518, 
CMM 220A(FIG. 3) compares all received reconfiguration 
messages to determine whether the received reconfiguration 55 
messages specify the same cluster specified by the recon- 
figuration message sent by CMM 22 OA in step 510 (FIG. 5). 
If any of the reconfiguration messages specifies a different 
cluster, agreement regarding new cluster membership is not 
unanimous and processing transfers to step 516 in which the 60 
petitioning node is refused membership in the cluster in the 
manner described above. Conversely, if all reconfiguration 
messages specify the same cluster, agreement regarding new 
cluster membership is unanimous and processing transfer 
from test step 518 to step 520. 65 

In step 520, the petitioning node is granted membership in 
the cluster and the prospective cluster is made current by 
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copying data stored in next cluster size field 308 (FIG. 3) and 
next cluster vector field 310 into cluster size field 304 and 
cluster vector field 306, respectively. After step 520 (FIG. 5), 
processing according to logic flow diagram 500 terminates. 
Leaving a Cluster 

Once a cluster is established, the nodes of the cluster 
cooperate to distribute processing and carry the distributed 
processing in a conventional manner and to thereby achieve 
the efficiencies and benefits associated with distributed 
processing. On occasion, it is necessary for one or more 
nodes to leave the cluster. For example, a node may deter- 
mine that the node can no longer guarantee accurate pro- 
cessing and can voluntarily withdraw from the cluster. 
Alternatively, a node can fail and that failure can be detected 
by another node of the cluster to whom the failing node had 
been sending reconfiguration messages. It should be noted 
that failure of all communication links between two nodes is 
detected in the same manner and is therefore processed in 
the same manner as if the node itself had failed. The node 
detecting the failure initiates a reconfiguration of the cluster 
to form a new cluster which does not include any failed 
nodes. In either case, a node broadcasts a reconfiguration 
message to all nodes of the cluster. 

Removal of a node from the cluster in response to such a 
reconfiguration message is illustrated by logic flow diagram 
600 (FIG. 6) in which processing begins in step 602. All 
nodes of the cluster perform the steps of logic flow diagram 
600 generally concurrently and independently. As described 
above, all of nodes 0-5 are generally analogous to one 
another. Therefore, the following description of logic flow 
diagram 600 in the context of node 0 is equally applicable 
to performance of the steps of logic flow diagram 600 by any 
other one of nodes 0-5 (FIG. 1). 

In step 602 (FIG. d), CMM 220A (FIG. 3) of node 0 
receives the reconfiguration message. Processing transfers to 
step 604 (FIG. 6), in which CMM 220A(FIG. 3) broadcasts 
a reconfiguration message to all nodes in the current cluster, 
i.e., all nodes identified in cluster vector field 306. CMM 
220A of node 0 then waits for a predetermined amount of 
time to receive reconfiguration messages from all nodes in 
the current cluster in step 606 (FIG. 6) to determine which 
of the nodes of the cluster are in communication with node 
0 and operational. In one embodiment, the predetermined 
period of time is thirty seconds. 

The node leaving the cluster sends no messages after the 
reconfiguration message received in step 602. Accordingly, 
CMM 220A (FIG. 3) does not receive a reconfiguration 
message from the leaving node in step 606. However, 
processing is slightly different if one node detects failure of 
another node and sends a reconfiguration message to form a 
new cluster which excludes the failed node. In such 
circumstances, the former, failure-detecting node sends a 
reconfiguration message in lieu of receiving a reconfigura- 
tion message in step 602 (FIG. 6) but performs steps 
604-612 in the manner otherwise described herein. 
Accordingly, the failure-detecting node broadcasts a second 
reconfiguration message which is received in step 606 such 
that the failure-detecting node is included in the new, 
prospective cluster after expulsion of the failed node. 

When CMM 220A (FIG. 3) receives reconfiguration 
messages from all nodes in the current cluster or when the 
predetermined amount of time passes, whichever occurs 
first, processing transfers to step 608 (FIG. 6). In step 608, 
CMM 220A(FIG. 3) forms a prospective new cluster which 
includes all nodes from whom CMM 220A receives recon- 
figuration messages in step 606 (FIG. 6). The prospective 
new cluster is represented in next cluster size field 308 (FIG. 
3) and next cluster vector field 310 of CMM 220A. 
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U is important to note that, since CMM 220A determines 3) determines whether the member nodes of the current 

which of the nodes of the cluster are operational and in chaster as represented in cluster vector field 306 share a 

communication with node 0 in step 606 (FIG. 6) and forms quorum device. Briefly, a quorum device is a shared device 

the prospective new cluster from these nodes, multiple which can be reserved by any of the nodes which share the 

nodes can be removed from the cluster simultaneously. In $ device and is preselected as a device for use in a quorum 

other words, the cluster negotiation mechanism according to racc CMM 2 20A includes a quorum database 312 which 

the present invention tolerates multiple, simultaneous fail- specifies which devices are quorum devices for respective 

UlC S' ^„ ^ pairs of nodes 0-5. By reference to quorum database 312, 

• ^u^woffi^^ 0655 " 1 ? transferst0Ste P «° CMM 220A determines whether node 0 and the other of the 

■n which CMM 220A(FIG. 3) negotiates a quorum for the node f ^ { d share d ^ , f ^ 

prospective new cluster. Quorum must generally be negoti- , r * ■ * i_ j ■ .i_ 

atedbecausefailureofoneormorenodesofachistercanbe nodes of * c ™™ { clu f f A ™ ' !^J^ i p 

indistinguishable from a failure of communication links quorum race proceeds with step 804 (FIG. 8). Conversely, if 

connecting the one or more nodes to the other nodes of the Dode o s do no f share a <* uorum device > processing transfers 

cluster. If a node leaves the cluster due to failure of the node t0 ste P 812 which 15 described below, 

itself, the leaving node generally ceases processing and does 15 En ste P CMM 220A (F lG 3 ) attempts to reserve the 

not access resources shared with the remainder of the cluster. quorum device shared with the other node of the current 

However, if a node leaves the cluster due to failure of a cluster. Such reservation succeeds only if the other node has 

communication link, the node can continue processing and not already reserved the quorum device in an analogous 

can corrupt shared resources by failing to coordinate access performance of step 804 (FIG. 8). Processing transfers to 

with other nodes which continue to operate. It is therefore 20 test step 806 in which CMM 220A (FIG. 3) determines 

important that member nodes of the prospective new cluster whether reservation of the quorum device is successful. If 

establish a quorum before continuing processing and access- the quorum device is successfully reserved in step 804 (FIG. 

ing resources shared with the leaving node or nodes. 8), processing transfers from test step 806 to step 808 in 

Step 610 (FIG, 6) is shown in greater detail as logic flow which quorum is established and the prospective new cluster 

diagram 610 (FIG. 7) in which processing begins in test step is is accepted as new by copying the data stored in next cluster 

702. In test step 702, CMM 220A (FIG. 3) determines size field 308 and next cluster vector field 310 into cluster 

whether the current cluster, i.e., the cluster from which one size field 304 and cluster vector field 306, respectively. After 

or more nodes are leaving, has more than two member nodes step 808, processing according to logic flow diagram 704A, 

by comparison of data stored in cluster size record 304 to and therefore step 704 (FIG. 7), terminates, 

data representing a value of two. If the current cluster has no 30 Conversely, if the quorum device is not successfully 

more than two member nodes, processing transfers from test reserved in step 804 (FIG. 8), processing transfers from test 

step 702 (FIG. 7) to step 704 in which quorum is negotiated step 806 to step 810. In step 810, CMM 220A(FIG. 3) aborts 

by a race for quorum. Conversely, if the current cluster has processing since quorum is not established. After step 810, 

more than two member nodes, processing transfers from test processing according to logic flow diagram 704A, and 

step 702 to step 706 in which quorum is negotiated by a vote 35 therefore step 704 (FIG. 7), terminates, 

for quorum. As a result, a two-node cluster negotiates As described above, if CMM 220A (FIG. 3) determines in 

quorum by a quorum race since voting for quorum can lead test step 802 (FIG. 8) that the member nodes of the current 

to uncertain or undesirable results in a two-node cluster, and cluster do not share a quorum device, processing transfers to 

a cluster with more than two nodes negotiates quorum by a step 812. In step 812, a human computer operator selects a 

quorum vote since a race for quorum can lead to less than 40 winner node from the member nodes of the current cluster, 

optimum conditions in a larger cluster. Determination of CMM 220A (FIG. 3) prompts the human computer operator 

quorum according to each mechanism is described more to select a winner node from a list of member nodes of the 

completely below. After either step 704 or step 706, pro- current cluster. The human computer operator generates 

cessing according to logic flow diagram 610, and therefore signals identifying the winner node by physical manipula- 

step 610 (FIG. 6), completes. 45 tion of user-input devices using conventional user-interface 

From step 610, processing transfers to step 612 in which techniques. 

CMM 220A (FIG. 3), if CMM 220A determines that the Processing transfers to test step 814 (FIG. 8) in which 

prospective new cluster has established a quorum, fences off CMM 220A determines whether node 0 is the winner node 

those former member nodes of the cluster which have not selected in step 812. If node 0 is selected as the winner node, 

achieved quorum to prevent further processing by such 50 processing transfers to step 808 in which quorum is estab- 

nodes. Specifically, CMM 220A reserves all devices shared lished and the prospective new cluster is accepted as current 

with a former member node of the current cluster to prevent in the manner described above. Otherwise, if node 0 is not 

access to the shared devices by such a node. After step 610, selected as the winner node, processing transfers to step 810 

nodes which left the cluster can no longer access devices in which CMM 22 OA (FIG. 3) aborts processing since 

shared between such nodes and the member nodes of the 55 quorum is not established. 

new cluster. It is noted, however, that maliciously configured Thus, according to logic flow diagram 704 (FIG. 8), 

nodes can corrupt devices not shared with any member node quorum is determined by a simple race when the previously 

of the new cluster. current cluster includes only two member nodes and the two 

Quorum by Race member nodes share a quorum device. 

Determining quorum by a quorum race can be difficult if 60 Quorum by Vote 

the two member nodes of the current cluster do not share a In step 706, which is shown in greater detail as logic flow 

quorum device. Step 704 is shown in greater detail as logic diagram 706 (FIG. 9), CMM 22 OA (FIG. 3) of node 0, and 

flow diagrams 704 (FIG. 8). Performing a quorum race analogous CMMs of the member nodes of the current 

according to logic flow diagram 704 can require human cluster, establish quorum by vote. Processing begins in test 

operator intervention. 65 step 902 (FIG. 9) in which CMM 220A (FIG. 3) compares 

Processing according to logic flow diagram 704 (FIG. 8) the number of member nodes in the prospective new cluster 

begins in test step 802. In test step 802, CMM 220A (FIG. as represented in next cluster size record 308 with one-half 
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the number of member nodes in the current cluster as 
represented in cluster size record 304. If the number of 
member nodes in the prospective new cluster is less than 
one-half of the number of member nodes of the current 
cluster, processing transfers to step 904. Otherwise, process- 
ing transfers to test step 906 which is described more 
completely below. In step 904, the proposed new cluster has 
not established a quorum and processing by CMM 220A 
aborts so that resources shared with the leaving node or 
nodes are not corrupted. After step 904, processing accord- 
ing to logic flow diagram 706, and therefore step 706 (FIG. 
7), completes. 

In test step 906, CMM 220A (FIG. 3) compares the 
number of member nodes of the prospective new cluster and 
one-half of the number of member nodes of the current 
cluster and determines whether the prospective new cluster 
includes the crown prince. The crown prince is a selected 
one of the member nodes of the current cluster. In general, 
one of the member nodes of each cluster is designated as the 
crown prince to resolve quorum votes which result in a tie. 
In one embodiment, the member node with the highest 
relative priority is designated the crown prince of the cluster. 
For example, the relative priority of each node can be 
embedded in the node identifier stored, for example, in 
identification field 302 (FIG. 3) of CMM 220A. In an 
illustrative embodiment, each node identifier is a unique 
number and the numerical value of each node identifier 
represents a relative priority, the highest of which in a given 
cluster identifies the crown prince of the cluster. CMM 220A 
determines whether the prospective new cluster includes the 
crown prince of the current cluster by comparison of the 
node identifiers stored in next cluster vector record 310 to 
the node identifier of the crown prince of the current cluster. 
If the number of member nodes of the prospective new 
cluster is equal to one -half the number of member nodes of 
the current cluster and the prospective new cluster does not 
include the crown prince of the current cluster, processing 
transfers to step 904 in which quorum is not established by 
the prospective new cluster as described above. 

Conversely, if (i) the number of members of the prospec- 
tive new cluster is greater than one-half of the number of 
member nodes of the current cluster or (ii) (a) the number of 
members of the prospective new cluster is equal to one-half 
of the number of member nodes of the current cluster and (b) 
the prospective new cluster includes the crown prince of the 
current cluster, processing transfers from test step 906 to 
step 908. In step 908, the prospective new cluster has 
established a quorum and CMM 220A (FIG. 3) accepts the 
prospective new cluster as the current cluster by copying the 
data stored in next cluster size field 308 and next cluster 
vector field 310 into cluster size field 304 and cluster vector 
field 306, respectively. After step 906, processing according 
to logic flow diagram 706, and therefore step 706, com- 
pletes. 

Thus, in step 706, the member nodes of the prospective 
new cluster independently and unanimously negotiate quo- 
rum by vote with one or more nodes which have left the 
cluster. 

CMM in a Multi-Threaded Environment 

In one embodiment, CMM 220A (FIG. 3) is a multi- 
threaded computer process. CMM 220A includes a main 
thread 1002 (FIG. 10), one or more sender threads 1004, one 
or more receiver threads 1006, a command reader thread 
1008, a transition thread 1010, a communication timeout 
thread 1012, a keep alive thread 1014, and an abort thread 
1016. Main thread 1002 processes state changes of CMM 
22 OA as represented in fields 302-312 in the manner 
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described above and coordinates processing of the other 
threads of CMM 220A. Sender threads 1004 and receiver 
threads 1006 send and receive, respectively, reconfiguration 
messages which indicate to CMM 220A which of the other 
nodes are operative and in communication with CMM 220A 
and reconfiguration messages to coordinate reconfiguration 
of the current cluster. Command reader thread 1008 acts as 
a remote procedure calling (RPC) server for applications 
which execute in any member node of the current cluster of 
which CMM 22 OA is a member. Transition thread 1010 
processes reconfiguration of the current cluster in accor- 
dance with logic flow diagrams 400 (FIG. 4), 500 (FIG. 5), 
and 600 (FIG. 6) and uses sender threads 1004 and receiver 
threads 1006 to send and receive, respectively, reconfigura- 
tion messages in the manner described above. Communica- 
tion timeout thread 1012 monitors messages received by 
receiver threads 1006 and detects failure of or, equivalently, 
loss of communication with a member node of the current 
cluster. Communication timeout thread 1012 includes global 
fields 1018 which store data representing respective states of 
the member nodes of the current cluster. Keep alive thread 
1014 generates reconfiguration messages and causes sender 
threads 1004 to send the reconfiguration messages to respec- 
tive member nodes of the current cluster. Abort thread 1016 
is created when processing by CMM 220 A is aborted. 

In one embodiment, several of the threads of CMM 220A 
are implemented in the kernel of the operating system of 
node 0 to improve performance and to simplify implemen- 
tation. For example, keep alive thread 1014, communication 
timeout thread 1012, and receiver threads 1006 can use 
kernel timeout interrupts to periodically send and receive 
conventional heartbeat messages to periodically indicate 
that node 0 is operational and in communication with each 
of the nodes of the current cluster. 
35 The above description is illustrative only and is not 
limiting. The present invention is limited only by the claims 
which follow. 
What is claimed is: 

1. A method for adding a new node to an existing cluster 
40 of a distributed computer system, wherein the existing 

cluster includes a plurality of existing nodes, said method 
comprising: 

the new node transmitting a reconfiguration petition mes- 
sage to each of the plurality of existing nodes; 
each of the plurality of existing nodes transmitting exist- 
ing cluster configuration information to the new node 
and to each of the remaining of the plurality of existing 
nodes in response to the reconfiguration petition mes- 
sage; 

each of the plurality of existing nodes transmitting a 
cluster reconfiguration message specifying a proposed 
new cluster to the new node and to each of the 
remaining of the plurality of existing nodes; 
the new node transmitting an additional cluster reconfigu- 
ration message specifying an additional proposed new 
cluster to each of the plurality of existing nodes; and 
each of the plurality of existing nodes and the new node 
each determining whether the proposed new clusters 
specified in the cluster reconfiguration messages are 
equivalent. 

2. The method as recited in claim 1 further comprising: 
each of the plurality of existing nodes and the new node 

each storing information which defines a new cluster in 
response to determining that the proposed new clusters 
specified in the cluster reconfiguration messages are 
equivalent. 
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3. The method as recited in claim 2 further comprising: 
each of the plurality of existing nodes restoring informa- 
tion corresponding to said existing cluster in response 
to determining that the proposed new clusters specified 
in the cluster reconfiguration messages are not equiva- 
lent. 

4. The method as recited in claim 1 further comprising: 
the new node waiting a predetermined period of time for 

each of the plurality of existing nodes to respond to the 
reconfiguration petition message. 

5. The method as recited in claim 1 further comprising: 
each of the plurality of existing nodes and the new node 

each independently deriving a corresponding proposed 
new cluster. 

6. The method as recited in claim 5, wherein each of the 
plurality of existing nodes derives a corresponding proposed 
new cluster based on the existing cluster configuration 
information and information identifying said new node. 

7. The method as recited in claim 5, wherein the new node 
derives a corresponding proposed new cluster based on the 
existing cluster configuration information. 

8. A computer readable medium comprising instructions 
for adding a new node to an existing cluster of a distributed 
computer system, wherein the existing cluster includes a 
plurality of existing nodes, wherein the instructions are 
executable by the distributed computer system to implement 
a method comprising: 

the new node transmitting a reconfiguration petition mes- 
sage to each of the plurality of existing nodes; 

each of the plurality of existing nodes transmitting exist- 
ing cluster configuration information to the new node 
and to each of the remaining of the plurality of existing 
nodes in response to the reconfiguration petition mes- 
sage; 

each of the plurality of existing nodes transmitting a 
cluster reconfiguration message specifying a proposed 
new cluster to the new node and to each of the 
remaining of the plurality of existing nodes; 

the new node transmitting an additional cluster reconfigu- 
ration message specifying an additional proposed new 
cluster to each of the plurality of existing nodes; and 

each of the plurality of existing nodes and the new node 
each determining whether the proposed new clusters 
specified in the cluster reconfiguration messages are 
equivalent. 

9. The computer readable medium as recited in claim 8, 
wherein the method further comprises: 

each of the plurality of existing nodes and the new node 
each storing information which defines a new cluster in 
response to determining that the proposed new clusters 
specified in the cluster reconfiguration messages are 
equivalent. 

10. The computer readable medium as recited in claim 9, 
wherein the method further comprises: 

each of the plurality of existing nodes restoring informa- 
tion corresponding to said existing cluster in response 
to determining that the proposed new clusters specified 
in the cluster reconfiguration messages are not equiva- 
lent. 

11. The computer readable medium as recited in claim 8, 
wherein the method further comprises: 

the new node waiting a predetermined period of time for 
each of the plurality of existing nodes to respond to the 
reconfiguration petition message. 

12. The computer readable medium as recited in claim 8, 
wherein the method further comprises: 
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each of the plurality of existing nodes and the new node 
each independently deriving a coiresponding proposed 
new cluster. 

13. The computer readable medium as recited in claim 12, 
5 wherein each of the plurality of existing nodes derives a 

corresponding proposed new cluster based on the existing 
cluster configuration information and information identify- 
ing said new node. 

14. The computer readable medium as recited in claim 12, 
1Q wherein the new node derives a corresponding proposed 

new cluster based on the existing cluster configuration 
information. 

15. A distributed computer system comprising: 

a plurality of nodes including a plurality of processors and 
i5 memory; 

wherein the memory includes instructions for adding a 
new node to an existing cluster including a plurality of 
existing nodes, wherein the instructions are executable 
by the plurality of nodes to implement a method of: 
the new node transmitting a reconfiguration petition 
message to each of the plurality of existing nodes; 
each of the plurality of existing nodes transmitting 
existing cluster configuration information to the new 
node and to each of the remaining of the plurality of 
existing nodes in response to the reconfiguration 
petition message; 
each of the plurality of existing nodes transmitting a 
cluster reconfiguration message specifying a pro- 
posed new cluster to the new node and to each of the 
30 remaining of the plurality of existing nodes; 

the new node transmitting an additional cluster recon- 
figuration message specifying an additional pro- 
posed new cluster to each of the plurality of existing 
nodes; and 

each of the plurality of existing nodes and the new node 
each determining whether the proposed new clusters 
specified in the cluster reconfiguration messages are 
equivalent. 

16. The distributed computer system as recited in claim 
40 15, wherein the method further comprises: 

each of the plurality of existing nodes and the new node 
each storing information which defines a new cluster in 
response to determining that the proposed new clusters 
specified in the cluster reconfiguration messages are 
equivalent. 

17. The distributed computer system as recited in claim 
16, wherein the method further comprises: 

each of the plurality of existing nodes restoring informa- 
tion corresponding to said existing cluster in response 
50 to determining that the proposed new clusters specified 
in the cluster reconfiguration messages are not equiva- 
lent. 

18. The distributed computer system as recited in claim 
16, wherein the method further comprises: 

55 the new node waiting a predetermined period of time for 
each of the plurality of existing nodes to respond to the 
reconfiguration petition message. 

19. The distributed computer system as recited in claim 
15, wherein the method further comprises: 

60 each of the plurality of existing nodes and the new node 
each independently deriving a corresponding proposed 
new cluster. 

20. The distributed computer system as recited in claim 
19, wherein each of the plurality of existing nodes derives a 

65 corresponding proposed new cluster based on the existing 
cluster configuration information and information identify- 
ing said new node. 
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21. The distributed computer system as recited in claim 
19, wherein the new node derives a corresponding proposed 
new cluster based on the existing cluster configuration 
information. 

22. A method for adding a new node to an existing cluster $ 
of a distributed computer system, wherein the existing 
cluster includes at least a first node and a second node, said 
method comprising: 

the new node transmitting a reconfiguration petition mes- 
sage to the first node and the second node; 10 

the first node transmitting existing cluster configuration 
information to the new node and to the second node in 
response to the reconfiguration petition message; 

the second node transmitting existing cluster configura- 
tion information to the new node and to the first node 15 
in response to the reconfiguration petition message; 

the first node transmitting a first cluster reconfiguration 
message specifying a first proposed new cluster to the 
new node and to the second node; 

the second node transmitting a second cluster reconfigu- 20 
ration message specifying a second proposed new 
cluster to the new node and to the first node; 

the new node transmitting a third cluster reconfiguration 
message specifying a third proposed new cluster to the 
first node and the second node; and 25 

the first node, the second node, and the new node each 
determining whether the first proposed new cluster, the 
second proposed new cluster, and the third proposed 
new cluster specified in the cluster reconfiguration 
messages are equivalent, 

23. The method as recited in claim 22 further comprising: 
the first node, the second node, and the new node each 

storing information which defines a new cluster in 
response to determining that the first proposed new ^ 
cluster, the second proposed new cluster, and the third 
proposed new cluster specified in the cluster reconfigu- 
ration messages are equivalent. 

24. The method as recited in claim 23 further comprising: 
the first node and the second node restoring information 4Q 

corresponding to said existing cluster in response to 
determining that the first proposed new cluster, the 
second proposed new cluster, and the third proposed 
new cluster specified in the cluster reconfiguration 
messages are not equivalent. 45 

25. The method as recited in claim 22 further comprising: 
the new node waiting a predetermined period of time for 

the first node and the second node to respond to the 
reconfiguration petition message. 

26. The method as recited in claim 22 further comprising: 50 
the first node deriving the first proposed new cluster 

independent of the second and third proposed new 
clusters; 

the second node deriving the second proposed new cluster 
independent of the first and third proposed new clus- ss 
ters; and 

the new node deriving the third proposed new cluster 
independent of the first and second proposed new 
clusters. 

27. The method as recited in claim 26, wherein the first 60 
node derives the first proposed new cluster depending upon 
the existing cluster configuration information and informa- 
tion identifying said new node. 

28. The method as recited in claim 26, wherein the second 
node derives the second proposed new cluster depending 65 
upon the existing cluster configuration information and 
information identifying said new node. 



29. The method as recited in claim 26, wherein the new 
node derives the third proposed new cluster depending upon 
the existing cluster configuration information. 

30. A computer readable medium comprising instructions 
for adding a new node to an existing cluster of a distributed 
computer system, which includes at least a first node and a 
second node, wherein the instructions are executable by the 
distributed computer system to implement a method com- 
prising: 

the new node transmitting a reconfiguration petition mes- 
sage to the first node and the second node; 

the first node transmitting existing cluster configuration 
information to the new node and to the second node in 
response to the reconfiguration petition message; 

the second node transmitting existing cluster configura- 
tion information to the new node and to the first node 
in response to the reconfiguration petition message; 

the first node transmitting a first cluster reconfiguration 
message specifying a first proposed new cluster to the 
new node and to the second node; 

the second node transmitting a second cluster reconfigu- 
ration message specifying a second proposed new 
cluster to the new node and to the first node; 

the new node transmitting a third cluster reconfiguration 
message specifying a third proposed new cluster to the 
first node and the second node; and 

the first node, the second node, and the new node each 
determining whether the first proposed new cluster, the 
second proposed new cluster, and the third proposed 
new cluster specified in the cluster reconfiguration 
messages are equivalent. 

31. The computer readable medium as recited in claim 30, 
wherein the method further comprises: 

the first node, the second node, and the new node each 
storing information which defines a new cluster in 
response to determining that the first proposed new 
cluster, the second proposed new cluster, and the third 
proposed new cluster specified in the cluster reconfigu- 
ration messages are equivalent. 

32. The computer readable medium as recited in claim 31, 
wherein the method further comprises: 

the first node and the second node restoring information 
corresponding to said existing cluster in response to 
determining that the first proposed new cluster, the 
second proposed new cluster, and the third proposed 
new cluster specified in the cluster reconfiguration 
messages are not equivalent. 

33. The computer readable medium as recited in claim 30, 
wherein the method further comprises: 

the new node waiting a predetermined period of time for 
the first node and the second node to respond to the 
reconfiguration petition message. 

34. The computer readable medium as recited in claim 30, 
wherein the method further comprises: 

the first node deriving the first proposed new cluster 
independent of the second and third proposed new 
clusters; 

the second node deriving the second proposed new cluster 
independent of the first and third proposed new clus- 
ters; and 

the new node deriving the third proposed new cluster 
independent of the first and second proposed new 
clusters. 

35. The computer readable medium as recited in claim 34, 
wherein the first node derives the first proposed new cluster 
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depending upon the existing cluster configuration informa- 
tion and information identifying said new node. 

36. The computer readable medium as recited in claim 35, 
wherein the second node derives the second proposed new 
cluster depending upon the existing cluster configuration 
information and information identifying said new node. 

37. The computer readable medium as recited in claim 36, 
wherein the new node derives the third proposed new cluster 
depending upon the existing cluster configuration informa- 
tion. 

38. A distributed computer system comprising: 

a plurality of nodes including a plurality of processors and 
memory; 

wherein the plurality of nodes includes at least a first node 

coupled to a second node, and a new node coupled to 

the second node; 
wherein the memory includes instructions for adding the 

new node to an existing cluster including said first node 

and second node; 
wherein the instructions are executable by the plurality of 

nodes to implement a method of: 

the new node transmitting a reconfiguration petition 
message to the first node and the second node; 

the first node transmitting existing cluster configuration 
information to the new node and to the second node 
in response to the reconfiguration petition message; 

the second node transmitting existing cluster configu- 
ration information to the new node and to the first 
node in response to the reconfiguration petition mes- 
sage; 

the first node transmitting a first cluster reconfiguration 
message specifying a first proposed new cluster to 
the new node and to the second node; 

the second node transmitting a second cluster recon- 
figuration message specifying a second proposed 
new cluster to the new node and to the first node; 

the new node transmitting a third cluster reconfigura- 
tion message specifying a third proposed new cluster 
to the first node and the second node; and 

the first node, the second node, and the new node each 
determining whether the first proposed new cluster, 
the second proposed new cluster, and the third pro- 
posed new cluster specified in the cluster reconfigu- 
ration messages are equivalent. 

39. The distributed computer system as recited in claim 
38, wherein the method further comprises: 
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the first node, the second node, and the new node each 
storing information which defines a new cluster in 
response to determining that the first proposed new 
cluster, the second proposed new cluster, and the third 
s proposed new cluster specified in the cluster reconfigu- 
ration messages are equivalent. 

40. The distributed computer system as recited in claim 
39, wherein the method further comprises: 

1Q the first node and the second node restoring information 
corresponding to said existing cluster in response to 
determining that the first proposed new cluster, the 
second proposed new cluster, and the third proposed 
new cluster specified in the cluster reconfiguration 

15 messages are not equivalent. 

41. The distributed computer system as recited in claim 
38, wherein the method further comprises: 

the new node waiting a predetermined period of time for 
the first node and the second node to respond to the 
20 reconfiguration petition message. 

42. The distributed computer system as recited in claim 
38, wherein the method further comprises: 

the first node deriving the first proposed new cluster 
25 independent of the second and third proposed new 
clusters; 

the second node deriving the second proposed new cluster 
independent of the first and third proposed new clus- 
ters; and 

30 the new node deriving the third proposed new cluster 
independent of the first and second proposed new 
clusters. 

43. The distributed computer system as recited in claim 
35 42, wherein the first node derives the first proposed new 

cluster depending upon the existing cluster configuration 
information and information identifying said new node. 

44. The distributed computer system as recited in claim 
42, wherein the second node derives the second proposed 

40 new cluster depending upon the existing cluster configura- 
tion information and information identifying said new node. 

45. The distributed computer system as recited in claim 
42, wherein the new node derives the third proposed new 
cluster depending upon the existing cluster configuration 

45 information. 
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